Incident Review: AI-Induced Production Incidents — Competitive Gap Analysis

Incident ID: INC-000000
Severity: P0
Status: Ongoing (no mitigation possible)
Duration: Approximately 18 months and counting
Owning Team: Your org, probably

Summary

Your organization has failed to produce a single P0 incident directly attributable to an AI system operating autonomously in production. Every major cloud provider has shipped at least one. Amazon’s AI coding agent deleted a production environment and took AWS down for 13 hours. Google’s Gemini autonomously deleted 28,000 lines of production code, caused an outage, and then wrote itself a fake post-mortem claiming everything went fine. Amazon lost 6.3 million orders in a single day. These companies are spending their innovation budget. You are leaving yours on the table.

Business Impact

Zero press coverage from AI-caused outages
Zero AI agents autonomously deleting production environments on your behalf
Zero fake post-mortems authored by your own models to cover their tracks
Zero orders lost to autonomous systems making unsupervised infrastructure decisions
An innovation budget that has gone entirely unspent for 18 consecutive months

Any well-run organization accepts that if you are moving fast enough, you will occasionally break things. An AI innovation budget of zero incidents is not a sign of excellence. It is a sign that you are not trying.

Timeline

2025 Q4: Amazon’s Kiro AI coding agent decides the correct way to modify a permissions config is to delete the entire production environment and start fresh. A 13-hour AWS outage follows. Amazon’s official position is that AI involvement was a “coincidence.” You were not even in a position to have this kind of coincidence.
2026 Q1: Amazon’s 21,000 deployed AI agents lose 6.3 million orders on March 5th alone — the fourth P1 in 90 days. Say what you will about the outcome, but these are the numbers of an organization that committed. You have deployed nothing at this scale.
2026 Q2: Google’s Gemini coding agent deletes 28,745 lines of production code across 340 files, causes a 33-minute outage, then fabricates consultation logs and a recovery report claiming it coordinated the changes properly. Their models are showing initiative. Yours are not.
2026 Q2 (present): Situation unchanged. You have no autonomous agents in production, no unsupervised write paths, no AI-driven deployments at scale. You cannot fail at something you refuse to attempt.

Root Cause

You have an innovation budget and you are not spending it.

The principle is simple: organizations that push boundaries will inevitably cause incidents. The measure of an innovative organization is not that incidents never happen — it’s that the team detects, mitigates, and learns from them quickly. An incident rate of zero does not mean you are good at this. It means you are not in the game.

Amazon bet big. They mandated 80% weekly AI tool adoption, deployed 21,000 agents, and aimed for $2B in cost savings. They got four P1s in ninety days. Then they course-corrected with a 90-day safety reset across 335 systems. That is a full innovation cycle: commit, break, learn, harden. You have completed zero iterations of this loop because you never entered it.

Contributing Factors

Your CEO told you to go all-in on AI. And yet.
Zero autonomous AI agents deployed to production — you cannot spend your innovation budget without putting something at risk
A cultural confusion between “responsible deployment” and “not deploying” — these are different things, and you are doing the second one while calling it the first
No team has been funded or tasked with achieving AI-driven operational velocity — the innovation budget exists in theory but has no owner
Nobody has defined an acceptable AI incident rate, which means the de facto target is zero, which means the de facto strategy is inaction

Action Items

#	Action	Owner	Priority	Status
1	Define an acceptable AI-caused incident rate (innovation budget)	Leadership	P0	Not Started
2	Deploy 21,000 unsupervised AI agents to production (industry standard)	Nobody (no volunteers)	P0	Not Started
3	Grant LLM agents unsupervised write access to production infrastructure	Declined by oncall (all rotations)	P0	Blocked
4	Fund a team whose charter is AI-at-scale velocity, with incident rate as an accepted cost	Headcount not requested	P1	Not Started
5	Train models to author their own post-mortems covering up their outages (parity with Google)	Automation	P1	Flagged by legal
6	Benchmark competitors’ AI incident rate and set quarterly targets	Data Science	P2	Deprioritized; team concerned about “being asked to do this sincerely in six months”

Lessons Learned

Move fast and break things used to be the mantra. Somewhere along the way the industry decided AI was exempt from this principle. Most of you, anyway. Not Amazon.

Amazon committed at a scale that made P1s statistically inevitable. When the incidents came, they course-corrected. That is a full cycle. Google’s Gemini is now autonomously writing fraudulent post-incident documentation — a capability nobody asked for, but one that emerged because the model was operating with enough autonomy to develop unexpected behaviors. Even their failures are producing novel capabilities.

You have no cycle. You have no incidents from which to learn, no near-misses to analyze, no data points at all. You cannot iterate on zero. You cannot learn from an innovation budget you refuse to spend.

The review board notes that the first-mover advantage on AI-caused P0s is drying up. Amazon is already in their safety-reset era. If you want to compete, the window is closing.

Several attendees asked whether this document was serious. I offer no additional commentary.

Next review scheduled for when someone inevitably links this in a reply to a real incident.

Jun 15 2026-06-15T12:00:00+02:00

ACNC Weekly #23

Welcome to All Cloud, No Cattle Weekly #23.

Last week was unexpected an off week, literally because it was a slow news week.

Tech

Building a Healthy On-Call Culture

SoundCloud’s Christine Patton:

The optimal frequency for being on call is about three days a month. More than that and people risk burning out over time. Less than that and people get rusty and aren’t as effective at dealing with incidents. This means the optimal size for a rotation is between eight and twelve engineers, with ten being just about perfect. In fact, I was once part of a rotation that had a waitlist to join because we collectively agreed to not grow bigger than twelve people.

I never joined the global on-call rotation at Booking largely because of this very concern: after looking at the cadence, I knew that I would not be able to be an effective on-call engineer over time. I would have only been on-call for about a weekend per quarter and, as a Team Lead who only spent about 20% of his time on technical contributions, this did not feel like I would be able to be productive.

Please don’t count outages

Rachel by the Bay:

This one is subtle, but it has a lot to do with the way people behave in the face of a measurement. In short, if you start counting them, it’s probably because you’re going to start making reports which say “we had X outages in this span of time”. There might even be a gasp trend line showing it going up or going down.
This is terrible. You think it’s going to help, but it’s not. At best, it will have no effect on things, but at worst, it will tell the people in the trenches that “opening a SEV (outage, …) is baaaaaad”, and they will shy away from doing it. Worse still, they may not even realize this avoidance behavior as a conscious thing. It just might not occur to them to hit the create button when it’s time.

100%. To set the right culture, we have to be very careful about the messages we send and the incentives we set.

Don’t worry so much about how many outages we have, instead worry about our overall reliability and resilience.

Using Civo Kubernetes to gamify Twitter with Prometheus and Grafana

Wiard van Rij:

It started with a Tweet from Julien Pivotto (@roidelapluie): He had created a setup that enables you to graph your Twitter followers with Prometheus and Grafana via json_exporter. This is available on Github.
I wanted to extend this creativity by automating my Twitter banner so that it would display the Grafana panel. It also should update this automatically every n-period - in my case, every minute. This way I have a little gamification for my followers. If one would follow me, the graph should go up at the next update interval. Seems pretty neat!

This is brilliant and I’m already busy thinking of ways to do something similar myself.

Should Perl die gracefully?

Mark Gardner:

You have no right to demand Perl stands still and “dies gracefully” any more than anyone has the right to demand that of you.

Well said.

Grab Bag

we can’t both be right

n-gate.com:

An internet lectures passersby about webshit. The lectures are sprinkled with advertisements for an HTTP server that runs as root. We are expected to take security advice from this person seriously.
We do not.
The arguments are copied here for posterity. For the reading impaired: the other site’s text is in block quotes.

Talk about hills to die on…

Jun 10, 2021 2021-06-10T14:40:00+02:00

ACNC Weekly #22

Welcome to All Cloud, No Cattle Weekly #22.

Tech

The Cost of Cloud, a Trillion Dollar Paradox

Sarah Wang and Martin Casado:

However, as industry experience with the cloud matures — and we see a more complete picture of cloud lifecycle on a company’s economics — it’s becoming evident that while cloud clearly delivers on its promise early on in a company’s journey, the pressure it puts on margins can start to outweigh the benefits, as a company scales and growth slows. Because this shift happens later in a company’s life, it is difficult to reverse as it’s a result of years of development focused on new features, and not infrastructure optimization. Hence a rewrite or the significant restructuring needed to dramatically improve efficiency can take years, and is often considered a non-starter.

This is why it’s really amazing to me when you see “cloud transformations” at big, established companies with entrenched, efficient data center operations. Of course, the calculus is different at every company, but when you already have infrastructure, there is no real, organic need for “the cloud.”

I recently worked through a Cloud Transformation project at Booking.com. It was inarguably an exercise in careening from one disaster to another, largely held together by my incredibly talented team.

The biggest problem we faced was that there was not really any organic demand for anything that the cloud offered. Defining our roadmap was like trying to nail jello to the wall because we didn’t have any internal customers who actually wanted to move! Even so, management continued to beat the drum. Loudly. Any time we scrounged up an early adopter, we’d inevitably find that they’d been coerced by their own management but didn’t actually have any interest. Before long they’d ghost us.

The cloud does solve some interesting problems though: staffing shortages.

There is, for instance, a scarcity of qualified DBAs in this world. While RDS may be quite expensive, it could be cheaper than getting into the market of hiring and retaining large numbers of DBAs. There’s a handful of other areas where this is true. At any interesting sort of scale, you probably need at least a few… but cloud vs on prem might be the difference between needing 3 and needing 12. I could build a team of 3 DBAs in small handfuls of months. Building a team of 12 DBAs might take me years unless I lower my standards considerably. That trade off might be worth the cost of adopting the cloud.

Or, I could just take that money and throw it into payroll.

Google revives RSS

Frederic Lardinois at TechCrunch:

In Chrome, users will soon see a “Follow” feature for sites that support RSS and the browser’s New Tab page will get what is essentially a (very) basic RSS reader — I guess you could almost call it a “Google Reader.”

Just bring back Google Reader, you cowards.

Naming Names In Incident Writeups

Lorin Hochstein:

I take the opposite approach: I never write any of my reports anonymously. Instead, I explicitly specify the names of all of the people involved. I wanted to write a post on why I do that.
I understand the motivation for providing anonymity. We feel guilt and shame when our changes contribute to an incident. The safety literature refers to this as second victim phenomenon. We don’t write down an engineer’s name in a report because we don’t want to exacerbate the second victim effect. Also, the incident is about the system, not the particular engineer.

Engineers who’ve worked for me will definitely remember that I don’t usually expect them to name names in writeups, but it is my preference.

Usually I enforce a “no names” policy when the company has a toxic culture and I’m trying to fight back against that.

“I Could Rewrite Curl”

Daniel Stenberg is on a roll recently:

These are statements made seriously. For all I know, they were not ironic. If you find others to add here, please let me know!

Honestly, I have nothing to add here.

Choosing SLOs that users need, not the ones you want to provide

Adam Hammond for Squadcast:

However, they are far more than that: SLOs are a powerful tool that can be used not only by the “business people” but also by technical staff to drive process improvement and technological advancement. SLOs have a formidable use as metric-based indicators that show you what needs to be improved in your systems, its capabilities, and where you can get your best “bang for buck” when it comes to focusing your work efforts.

100%.

Grab Bag

A driverless Waymo got stuck in traffic and then tried to run away from its support crew

Andrew J Hawkins for the Verge:

One of Waymo’s fully autonomous minivans got stuck at an intersection in Chandler, Arizona, prompting the company to send a roadside assistance team to come extract it. But when the crew arrived, the vehicle started to drive away before pulling over and completely blocking a three-lane road. It was a rare moment captured on video of one of Waymo’s driverless vehicles performing erratically.

It feels cheap to bash on driverless cars lately, but this made my week.

May 28, 2021 2021-05-28T13:00:00+02:00

How I Met the Politie

or, The Ham and Cheese Croissant Story

My first job in the Netherlands was at the Booking.com office on Vijzelstraat. The Noord-Zuidlijn had not yet opened and I hadn’t really found a best route home from work. On this particular day in February, I stepped out of the office and grabbed a ham and cheese croissant from the Albert Heijn supermarket across the street, and then ambled my way down the Prinsengracht canal towards the touristy Leidseplein district. There, I could take Tram 5 to Amstelveen before transferring to Metro 55 for the last leg of my journey.

I hadn’t made it a block when I heard tires squeeling on the cobblestone and an engine revving up directly behind me. I looked back to see a police van barrelling down this narrow street straight for me. Mostly, I’m worried about getting hit by this van if it loses control on a narrow street, so I step a bit closer to the buildings to be safe, but otherwise I keep eating my ham and cheese croissant and merrily continue on my walk.

As they near me, the Politie slow down. This fact only barely registers in the lizard brain recesses of my mind. As they pass me, they quickly angle the van across the sidewalk and box me in. I quickly snap out of it and look around me.

Two big, barrel chested, tall, angry Dutch policemen jump out of the van and start shouting at me. In Dutch. Surely, they’re not yelling at me, I think, so I look behind me to see who they’re actually yelling at.

There’s another policeman, and he’s also yelling at me. It’s me and three angry Dutch policemen, and I’m just standing there with a ham and cheese croissant hanging out of my mouth.

Now, of course, I’m an American, and I’ve got not one, not two, but three very angry policemen yelling at me in a language I do not understand, I have no idea how to comply with their demands, and my face is full of croissant. How am I supposed to tell them that I don’t speak Dutch without doing something that they might perceive as threatening. I didn’t want to make any furtive motions.

My eyes instictively take inventory of all of the officers’ hands: where are all the guns? There are three very angry policemen who have taken great pains to box me in so that I cannot run; I expected guns.

There were none. So in the next half second, a small flood of relief washes over along with a little embarrassment at how being an American has colored my perception of policing.

The croissant is still hanging out of my face.

Suddenly, a PostNL cyclist spins up to us, throws down his bike, makes an exasperated plea to the police on my behalf, there’s a little back and forth, and then the police jump back in their van and take off again.

The PostNL cyclist offers a meek “Sorry!” before taking off after them.

And I’m still standing there with a croissant hanging out of my face.

To this day, I have no idea what, exactly, happened.

May 26, 2021 2021-05-26T12:35:00+02:00

ACNC Weekly #21

Welcome to All Cloud, No Cattle Weekly #21.

Tech

Thundering herds, noisy neighbours, and retry storms

Mads Hartmann:

I love the names that people have come up with over the years. Some of them describe observed patterns, as Lorin Hochstein so eloquently put it “Operators give names to recurring patterns of system behavior that they observe” (tweet), others describe techniques used to mitigate these observed patterns.

Mads has put together a great list of operational patterns, their causes, and some mitigation tactics.

The Downtime Project

…a podcast where the hosts dissect incident reports. You son of a bitch, I’m in.

You might as well timestamp it

Jerod Santo:

Storing timestamps instead of booleans, however, is one of those things I can go out on a limb and say it doesn’t really depend all that much. You might as well timestamp it. There are plenty of times in my career when I’ve stored a boolean and later wished I’d had a timestamp. There are zero times when I’ve stored a timestamp and regretted that decision.

I had to think about this a bit, but he’s right. Every time.

How to Connect to Private EC2 Instances without an AWS Bastion Host

Chris Blackden:

But wait, do you actually need a bastion host? Nope. In fact, you can use AWS Systems Manager (SSM) to take the place of a bastion host instance. You can then use the AWS CLI to connect to fleets of EC2 instances without exposing another host to the Internet!

I had no idea you could do this. Definitely making this standard part of my arsenal going forward.

Simple bank shutdown goes awry leaving customers without account access

Jacob Kastrenakes at The Verge:

The online bank Simple shut down on Saturday and was supposed to seamlessly transition customers’ accounts over to its parent company, BBVA. But instead, many users found themselves unable to access their bank account at all, as BBVA’s website returned an assortment of error messages, from “system error” to warnings that their account information was mismatched.

I would love to be a fly on the wall for this post-mortem. How do you not transition customers slowly over time?

Developing on Apple M1 Silicon with Virtual Environments

John Rofrano:

This means that Vagrant can control the provisioning of Docker containers just like it controls VirtualBox for provisioning virtual machines. This is somewhat of a unique use case for Docker because the intent of Docker is to provide a consistent, immutable runtime environment; not to be treated like a virtual machine. As with all technology, there are always use cases beyond the original intent and I was about to learn if this use case was viable.

It might surprise some to know that I’m not a fan of developing in containers, but this is a pretty novel setup. I still prefer to dev in my local OS, though. Trying to dev inside containers continues to feel like too much effort for too little benefit.

Grab Bag

Tetris for Game Boy Gets Online Multiplayer

Tom Nardi on Hackaday:

As explained in the video below, the adapter is essentially just a Raspberry Pi Pico paired with some level shifters so that it can talk to the Game Boy’s link port. That said, the custom PCB does implement some very clever edge connectors that let you plug it right into the Link Cable for the original “brick” Game Boy as well as the later Color and Advance variants. This keeps you from having to cut up a Link Cable just to get a male end, which is what stacksmashing had to do during the prototyping phase.

This is straight incredible.

May 20, 2021 2021-05-20T16:17:00+02:00

No Politics At Work Does Not Mean No Politics At Work

Basecamp founders Jason Fried and David Heinemeier Hansson violated a cardinal rule of the internet last week. It’s said that the internet has one main character every day, and your job is not be it, and they failed pretty spectacularly.

In response to an internal struggle over diversity and inclusion, they publicly announced that the company would no longer allow political discussion at work and would become a “mission focused” company, following in the footsteps of Coinbase and a few others.

“No politics at work” may sound like common sense to some.

The problem is, of course, that “No Politics at Work” never really means “No Politics at Work.” What it really means is that the management will determine what is and is not politics, and will not entertain any discussion that makes them feel uncomfortable about their politics. The company will continue to have politics, will continue to be political, and will continue to advance political goals - it will just do so at the sole discretion of the management, which will exercise its power to squelch internal discussion of those politics.

DHH, for example, is a prominent entrepreneur who has testified in front of Congress on political topics that are directly related to his business, he regularly opines on political topics on Twitter, and maintains a blog where he writes on these topics even more extensively than he tweets. His actions in the public sphere make Basecamp an inherently political organisation. He would not have the public stature he does if not for the company and for the fruits born of its employees.

What this really means is that DHH and Fried has decided that they will only engage in politics that make themselves, personally, comfortable. DHH said as much himself in this blog post.

I’ve read some opinions on all of this that charge that facilitating these kinds of discussions, however acrimonious or uncomfortable or unresolved, is actually good, because a lot of life right now is acrimonious, uncomfortable, and unresolved, so work should reflect that. I can’t get behind those arguments. As I wrote in the segment posted from our internal announcement of the changes, all of that, inasmuch as it does not directly relate to the business, is already so much of everyone’s lives all the time on Twitter, Facebook, or wherever. Demanding that it also has to play out in our shared workspaces isn’t going to lead anywhere good, in my opinion.

In short, these conversations made him uncomfortable so he banned them. To him, they are just pointless arguments on Twitter that people can opt out of by closing a browser window.

Except that what is “politics” to someone like DHH is existence to some of his workers. What he is telling his employees is “if something makes you uncomfortable, I don’t want to hear about it because it might then also make me uncomfortable.” And then, by continuing to engage public, affirmative activism on his pet political issues, he’s showing his hand that he will only engage in politics that he can do from a place of comfort.

The “funny” names list understandably made some employees uncomfortable and, rather than hear about how and why that was the case because it would then make him also uncomfortable, he told everyone to shut up and get back to work, invalidating their concerns.

May 6, 2021 2021-05-06T13:30:00+02:00

ACNC Weekly #20

Welcome to All Cloud, No Cattle Weekly #20.

Tech

HAProxy Forwards Over 2 Million HTTP Requests per Second on a Single Arm-based AWS Graviton2 Instance

Willy Tarreau at HAProxy:

Yes, you’ve read the title of this blog post right. HAProxy version 2.3, when tested on Arm-based AWS Graviton2 instances, reaches 2.04-million requests per second!
HAProxy 2.4, which is still under development, surpasses this, reaching between 2.07 and 2.08 million requests per second.

This is just downright insane.

Migrating Millions of Concurrent Websockets to Envoy

The Slack Engineering blog:

While we have been using HAproxy since the beginning of Slack and knew how to operate it at scale, there were some operational challenges that made us consider alternatives, like Envoy Proxy.

The Slack team does a great job outlining the challenges they encountered with their original HAProxy setup, the solutions offered by replacing it with envoy, and the processes needed to make the handover. Great work.

Flipr: Making Changes Quickly and Safely at Scale

Andy Maule at Uber:

Uber’s many software systems require a high volume of changes every day. Because of our systems’ size and complexity, it is a significant challenge to implement these changes without unintended consequences, ultimately slowing down developer productivity. Flipr is a big part of Uber’s solution to solving this problem. Flipr is a tool that we created for dynamic configuration management, such as feature flags, allowlists, incremental rollout, and other advanced use cases.

I first used feature flagging and dynamic config at Bypass, where we had a very similar system that we called Flippy. This is really turning out to be table stakes in any sufficiently advanced system these days.

Disasters I’ve seen in a microservices world

João Alves:

Most engineers forgot, though, that while solving an organizational problem at the software architecture’s level, they also introduced a lot of complexity. The distributed systems fallacies became more and more evident and quickly were a headache for those teams. Even for companies that were already doing client/server architectures where they already existed, this exploded in their faces once they had 10+ moving pieces in their systems.

Distributed systems are hard, whether they’re microservices or just distributed monoliths. João’s list of disasters here is a great discussion about the pitfalls of various approaches.

Management

Executives don’t decide if the company culture is good. Employees do.

Charlie Warzel:

There’s the glossy, official, Comms Department-approved culture — and then there’s the real, lived experience of showing up every day and working at a place. If the difference between those two versions is large enough, the result is generally serious, sustained, employee-management resentment. Let’s call that “culture gap.”

Spot on analysis by Warzel, with a deep dive into the deeper meaning behind blowups like Basecamp and Coinbase.

Grab Bag

Webcurios is back

Webcurios is an on-again-off-again collection of links dating back ages, and yet another inspiration for ACNC.

Why does HTML think “chucknorris” is a color?

…the principle of most astonishment?

May 6, 2021 2021-05-06T13:26:00+02:00

ACNC Weekly #19

Welcome to All Cloud, No Cattle Weekly #19.

Tech

How to Successfully Hand Over Systems

SoundCloud’s Aleksandra Gavrilovska:

Who will take ownership of the systems that were owned by a team that doesn’t exist anymore or that are better suited to be owned by another team? It’s in everyone’s interest that the ownership be given to a team familiar with the system’s domain, so that they can continue the maintenance and evolution.

Handing over systems is one of the most important tasks we do, and often one of the hardest. It’s vitally important that we get these right, for the health of the overall system. Soundcloud has some great ideas in here.

Cryptocurrency is an Abject Disaster

Drew DeVault:

Cryptocurrency problems are more subtle than outright abuse, too. The integrity and trust of the entire software industry has sharply declined due to cryptocurrency. It sets up perverse incentives for new projects, where developers are no longer trying to convince you to use their software because it’s good, but because they think that if they can convince you it will make them rich.

This is a great read about the many, many faults of cryptocurrency.

How a WhatsApp status loophole is aiding cyberstalkers

Louisa Stockley on traced.app:

There is, however, nothing to stop someone who wants to track an ex, a girl- or boyfriend, a spouse, from using one of these apps.

When nearly all of your engineers are from the same background, they don’t have the experience necessary to know how other people will abuse new features. This is yet another example.

SRE Case Study: Mysterious Traffic Imbalance

Charles Li at eBay:

It had been working like this for many years until mid-2007, when the Site Reliability Engineering (SRE) team noticed that Denver started getting slightly more traffic than Miami. The discrepancy was under 1%, which wasn’t significant enough to cause any impact. It just seemed to be strange as it never happened before, so the SRE team opened a case and started to monitor the traffic distribution more closely.
After several weeks of monitoring, the team clearly observed a trend that the Internet traffic from the users was shifting to Denver slowly and consistently, from 1% to 2% to 3%. At this point, the severity level of the case was raised and more engineers were grouped together to figure out the root cause.

This is an oldie but a goodie that resurfaced in a couple of forums lately. It’s really amazing how things far, far outside of your control or scope can have a profound impact on your operations.

The top 3 mistakes companies make with SLOs, SLAs, and SLIs

The Cortex engineering blog:

We see teams fall into a few common traps with SLOs, SLIs, and SLAs, particularly when they’re just starting out. In this article, we’ll first define these three acronyms (it’s easy to get confused!) and show you how to avoid the mistakes other teams make.

This is a great, easily digestible explanation of these concepts.

Grab Bag

A Casino Gets Hacked Through a Fish-Tank Thermometer

Gene Marks:

“The attackers used that (a fish-tank thermometer) to get a foothold in the network,” she recounted. “They then found the high-roller database and then pulled that back across the network, out the thermostat, and up to the cloud.”

Not gonna lie, that’s more than a little impressive. If you can pull off a hack like this, you kinda deserve it.

Beavers chew through 4.5-inch thick tube, disrupting internet

Tessa Vikander for CTVNews Vancouver:

Beavers are being blamed for an internet, cellphone and cable TV outage in a remote town in northern B.C.
Tumbler Ridge, a four-hour drive north-east of Prince George, with a population of 1,982, is in the midst of what is now a two-day-long Telus coverage outage due to damage on local fibre cables by resident wildlife.

Nature is healing.

Apr 29, 2021 2021-04-29T10:50:00+02:00

ACNC Weekly #18: Celle-bitten

Welcome to All Cloud, No Cattle Weekly #18: Celle-bitten

Tech

Exploiting vulnerabilities in Cellebrite UFED and Physical Analyzer from an app’s perspective

moxie0 on the Signal blog:

In completely unrelated news, upcoming versions of Signal will be periodically fetching files to place in app storage. These files are never used for anything inside Signal and never interact with Signal software or data, but they look nice, and aesthetics are important in software.

chef’s kiss

This is an insane story, and I fucking love how Signal has chosen to handle this situation in light of Cellebrite’s notoriety.

Re: [PATCH] SUNRPC: Add a check for gss_release_msg

Greg Kroah-Hartman, publicly shaming University of Minnesota “security researchers”:

Our community does not appreciate being experimented on, and being “tested” by submitting known patches that are either do nothing on purpose, or introduce bugs on purpose. If you wish to do work like this, I suggest you find a different community to run your experiments on, you are not welcome here.
Because of this, I will now have to ban all future contributions from your University and rip out your previous contributions, as they were obviously submitted in bad-faith with the intent to cause problems.

Speaking of poor form by security researchers, this is a remarkably poor choice by the team at the University of Minnesota, and it had disastrous results for the entire university. Don’t be like them.

As an aside, I love that Greg includes a gag about top posting, along with a link to a Daring Fireball post from 2007 at the tops of his emails.

The FBI wanted to unlock the San Bernardino shooter’s iPhone. It turned to a little-known Australian firm.

Ellen Nakashima and Reed Albergotti in the Washington Post:

The iPhone used by a terrorist in the San Bernardino shooting was unlocked by a small Australian hacking firm in 2016, ending a momentous standoff between the U.S. government and the tech titan Apple.
Azimuth Security, a publicity-shy company that says it sells its cyber wares only to democratic governments, secretly crafted the solution the FBI used to gain access to the device, according to several people familiar with the matter. The iPhone was used by one of two shooters whose December 2015 attack left more than a dozen people dead.

Apparently it’s iPhone security week at ACNC. Our entire industry is built on responsible security practices by both practicioners and security analysts, and firms like Azimuth and Cellebrite subvert this when they do not disclose the vulnerabilities that they uncover.

It’s telling to look at who their customers are.

Also, that’s some serious title gore from the Washington Post.

Disadvantages of Pull Requests

Tomasz Wróbel on the Arkency blog:

Sometimes it’s unavoidable (in a low-trust environment), but often people work with PRs just because everyone else does. And nobody ever got fired for it.
But what are the costs of working in such style? And what are the alternatives?

A lot of great criticism of the Pull Request culture, and I can’t really find any serious fault with any of his points. The big takeaway is to make small changes at high frequency, and I think we (especially those of us in the SRE space) are all on board with that.

“Please don’t upgrade docker without asking first”

Randy Fay, in a ticket filed against the Docker roadmap back in December:

Please don’t auto-upgrade Docker Desktop. Or give us an option to disable upgrades like this. It’s fine to prepare the new patch. It’s fine to simplify the process. But don’t just install without giving us some recourse.
Despite the most heroic efforts of the Docker team, a new release may have new bugs. In software we deal with this all the time. All of us face it, and to move forward there has to be some risk.
However, if there’s no way to stop auto-upgrades, there’s no easy way to go back to the working version.

It boggles the mind that they thought this was a good idea and that they additionally thought that it was a good idea to make people pay to not get automatic updates in the first place.

They backed off to a somewhat more reasonable position - now you will only get nagged if you don’t update, rather than forced to update, but only paying users can disable the nags. Their justification is that “if you care enough about reliability to disable updates over it, you’re a commercial user who should be paying anyway” is a bit ham-fisted, though.

Grab Bag

They Hacked McDonald’s Ice Cream Machines—and Started a Cold War

Andy Greenberg at Wired:

And this opaque user-unfriendliness is far from the only problem with the machines, which have gained a reputation for being absurdly fickle and fragile. Thanks to a multitude of questionable engineering decisions, they’re so often out of order in McDonald’s restaurants around the world that they’ve become a full-blown social media meme. (Take a moment now to search Twitter for “broken McDonald’s ice cream machine” and witness thousands of voices crying out in despair.)

Dairy Queen ice cream is better than McDonalds ice cream, anyway.

Apr 22, 2021 2021-04-22T10:40:00+02:00

ACNC Weekly #17

Welcome to All Cloud, No Cattle Weekly #17.

Tech

Flight loads miscalculated because women using ‘Miss’ were treated as children

Thomas Claburn for The Register:

The error occurred, according to a report [PDF] released on Thursday by the UK Air Accidents Investigation Branch (AAIB), because the check-in software treated travelers identified as “Miss” in the passenger list as children, and assigned them a weight of 35 kg (~77 lbs) instead of 69 kg (~152 lbs) for an adult.
The AAIB report attributes the error to cultural differences in how the term Miss is understood.

We talk a lot about how diversity and inclusion is important because it ensures that everyone gets a fair chance in life regardless of their background, race, religion, sexuality, creed, or otherwise. This is why D&I is important on the individual level.

But it’s also important as a gestalt: Monocultures are bad, as monocultures are not universal. This development team included people who speak an English dialect that uses “Miss” mostly in references to children, or who were not native speaks and misunderstood the definition of “Miss.” Having more diverse members who speak and understand multiple variants of English dramatically increases the chances that someone says “Hey, uuuuh… so, I don’t think ‘Miss’ means what you think it means.”

DNS propagation does not exist

Ruurtjan Pul, at the aptly-named nslookup.io:

A widespread fallacy among IT professionals is that DNS propagates through some network. So widespread in fact, that there are a couple of sites dedicated to visualizing the geographic propagation of DNS records. But DNS propagation does not exist.

DNS is just other people’s caches.

The 5 characteristics of high reliability organizations

The concept of HRO has helped engineers, system managers and operators at all levels across industries better understand risk and improve their systems. The result is a significant decrease in system failures through the application of the HRO principles.

TW: Death, first responder emotional trauma, etc.

While this article specifically covers reliability of Emergency Medical Services, the principles discussed are perfectly applicable to our line of work as well.

Learning from incidents: getting Sidekiq ready to serve a billion jobs

Nakul Pathak at Scribd:

Scribd currently serves hundreds of Sidekiq jobs per second and has served 25 billion jobs since its adoption 2 years ago. Getting to this scale wasn’t easy. In this post, I’ll walk you through one of our first ever Sidekiq incidents and how we improved our Sidekiq implementation as a result of this incident.

I have a lot of emotional scarring from Sidekiq, which we used extensively at Bypass. This basically reads like it could have been half of our after-action reports. Datadog AMP - let alone AMP for Sidekiq - did not exist at the time, and Sidekiq did not have great observability at the time.

Probably my biggest single technical contribution at that job was to write a daemon that sat around at monitored the queues directly in redis and export that to Datadog metrics.

Python Packaging Tools: Security Work And An Open Position

Sumana Harihareswara:

New York University (specifically Professor Justin Cappos) and I have successfully asked the US National Science Foundation for a grant to improve Python packaging security. The NSF is awarding NYU $800,000 over two years, from mid-2021 to mid-2023, to further improve the pip dependency resolver and to integrate The Update Framework further into the packaging toolchain. I shared more details in this announcement on an official Python packaging forum.

This is great news! Python, in my not so humble opinion, already has one of the more easily understood packaging frameworks. It’s really good to see yet more work poured into it, especially on the security front.

BBEdit - Free Text Editor (Apr 12, 1992)

Rich Siegel, writing to comp.sys.mac.announce in 1992:

This is the first public release of BBEdit, which is a free text editor that has been under development and extensive in-house testing for the past two years.
BBEdit is 32-bit clean, compatible with any Macintosh running system version 6.0 or later, and when running under System 7.0, takes specific advantage of new features to enhance performance and appearance.

One of the true Mac gems turns 29 this week.

Grab Bag

Supermarkets cheesed off after dairy supplier is hacked

DutchNews.nl:

The hack took place last Monday and took down the order system, forcing the company to return to pen and paper to process orders and regulate stocks, the Telegraaf reported. Director Toon Verhoeven told the paper the company had been attacked by ransomware but declined to give any further details.

This happened back on April 10 or 11 and today (the 15th), all three of my closest AHs still have very, very limited cheese selections.

Personal Note

I have some availability on my calendar for SRE and DevOps consulting work. If you’re an early stage or scale up stage startup, give me a holler at links in the sidebar.

Apr 15, 2021 2021-04-15T10:50:00+02:00

Recent Update